130 research outputs found
Optimization by gradient boosting
Gradient boosting is a state-of-the-art prediction technique that
sequentially produces a model in the form of linear combinations of simple
predictors---typically decision trees---by solving an infinite-dimensional
convex optimization problem. We provide in the present paper a thorough
analysis of two widespread versions of gradient boosting, and introduce a
general framework for studying these algorithms from the point of view of
functional optimization. We prove their convergence as the number of iterations
tends to infinity and highlight the importance of having a strongly convex risk
functional to minimize. We also present a reasonable statistical context
ensuring consistency properties of the boosting predictors as the sample size
grows. In our approach, the optimization procedures are run forever (that is,
without resorting to an early stopping strategy), and statistical
regularization is basically achieved via an appropriate penalization of
the loss and strong convexity arguments
Analysis of a Random Forests Model
Random forests are a scheme proposed by Leo Breiman in the 2000's for
building a predictor ensemble with a set of decision trees that grow in
randomly selected subspaces of data. Despite growing interest and practical
use, there has been little exploration of the statistical properties of random
forests, and little is known about the mathematical forces driving the
algorithm. In this paper, we offer an in-depth analysis of a random forests
model suggested by Breiman in \cite{Bre04}, which is very close to the original
algorithm. We show in particular that the procedure is consistent and adapts to
sparsity, in the sense that its rate of convergence depends only on the number
of strong features and not on how many noise variables are present
The Statistical Performance of Collaborative Inference
The statistical analysis of massive and complex data sets will require the
development of algorithms that depend on distributed computing and
collaborative inference. Inspired by this, we propose a collaborative framework
that aims to estimate the unknown mean of a random variable . In
the model we present, a certain number of calculation units, distributed across
a communication network represented by a graph, participate in the estimation
of by sequentially receiving independent data from while
exchanging messages via a stochastic matrix defined over the graph. We give
precise conditions on the matrix under which the statistical precision of
the individual units is comparable to that of a (gold standard) virtual
centralized estimate, even though each unit does not have access to all of the
data. We show in particular the fundamental role played by both the non-trivial
eigenvalues of and the Ramanujan class of expander graphs, which provide
remarkable performance for moderate algorithmic cost
Statistical analysis of -nearest neighbor collaborative recommendation
Collaborative recommendation is an information-filtering technique that
attempts to present information items that are likely of interest to an
Internet user. Traditionally, collaborative systems deal with situations with
two types of variables, users and items. In its most common form, the problem
is framed as trying to estimate ratings for items that have not yet been
consumed by a user. Despite wide-ranging literature, little is known about the
statistical properties of recommendation systems. In fact, no clear
probabilistic model even exists which would allow us to precisely describe the
mathematical forces driving collaborative filtering. To provide an initial
contribution to this, we propose to set out a general sequential stochastic
model for collaborative recommendation. We offer an in-depth analysis of the
so-called cosine-type nearest neighbor collaborative method, which is one of
the most widely used algorithms in collaborative filtering, and analyze its
asymptotic performance as the number of users grows. We establish consistency
of the procedure under mild assumptions on the model. Rates of convergence and
examples are also provided.Comment: Published in at http://dx.doi.org/10.1214/09-AOS759 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Long signal change-point detection
The detection of change-points in a spatially or time ordered data sequence
is an important problem in many fields such as genetics and finance. We derive
the asymptotic distribution of a statistic recently suggested for detecting
change-points. Simulation of its estimated limit distribution leads to a new
and computationally efficient change-point detection algorithm, which can be
used on very long signals. We assess the algorithm via simulations and on
previously benchmarked real-world data sets
Consistency of random forests
Random forests are a learning algorithm proposed by Breiman [Mach. Learn. 45
(2001) 5--32] that combines several randomized decision trees and aggregates
their predictions by averaging. Despite its wide usage and outstanding
practical performance, little is known about the mathematical properties of the
procedure. This disparity between theory and practice originates in the
difficulty to simultaneously analyze both the randomization process and the
highly data-dependent tree structure. In the present paper, we take a step
forward in forest exploration by proving a consistency result for Breiman's
[Mach. Learn. 45 (2001) 5--32] original algorithm in the context of additive
regression models. Our analysis also sheds an interesting light on how random
forests can nicely adapt to sparsity. 1. Introduction. Random forests are an
ensemble learning method for classification and regression that constructs a
number of randomized decision trees during the training phase and predicts by
averaging the results. Since its publication in the seminal paper of Breiman
(2001), the procedure has become a major data analysis tool, that performs well
in practice in comparison with many standard methods. What has greatly
contributed to the popularity of forests is the fact that they can be applied
to a wide range of prediction problems and have few parameters to tune. Aside
from being simple to use, the method is generally recognized for its accuracy
and its ability to deal with small sample sizes, high-dimensional feature
spaces and complex data structures. The random forest methodology has been
successfully involved in many practical problems, including air quality
prediction (winning code of the EMC data science global hackathon in 2012, see
http://www.kaggle.com/c/dsg-hackathon), chemoinformatics [Svetnik et al.
(2003)], ecology [Prasad, Iverson and Liaw (2006), Cutler et al. (2007)], 3
- …